Conversation
…tion/aggregation - Update QMoE schema with new optional input 14 (router_weights) - Implement CPU provider support: when router_weights is provided, router_probs is used only for Top-K expert selection, and router_weights values at selected expert indices are used for output aggregation - Add not-implemented guards for CUDA and WebGPU providers - Update OperatorKernels.md documentation - Add test case for QMoE with separate router_weights Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
…on test case Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
@copilot, please investigate test errors in CI pipelines and fix it. |
Without swiglu_fusion=1, the default (0) causes fc1_out_features to be inter_size instead of 2*inter_size. The SwiGLU activation reads 2*inter_size values from the FC1 output buffer, causing out-of-bounds reads and NaN output on some platforms (e.g., Windows x86). Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
The NaN was caused by a missing Fixed in 200c402 by adding |
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Description
Adds optional input
router_weights(index 14) tocom.microsoft.QMoEto decouple Top-K expert selection from output aggregation weighting.When
router_weightsis provided:router_probs→ Top-K expert selection onlyrouter_weights→ values gathered at selected expert indices used as mixing weightsWhen omitted, existing softmax-of-
router_probsbehavior is preserved (backward compatible).Changes:
contrib_defs.cc): New optional input 14router_weights, type T, shape(num_tokens, num_experts)moe_quantization_cpu.cc): Implements the separate routing path with MLFloat16/float support and optionalnormalize_routing_weightsnormalizationmoe_quantization.cc): Reads input, enforces not-implemented if providedqmoe.cc): Same not-implemented guardmoe_test.cc):QMoETest_CPU_RouterWeightscovering both normalized and unnormalized paths with non-zero expected outputs via FC2 bias to validate correct aggregation weightsOperatorKernels.md): Updated CPU and CUDA entriesThis pattern matches DeepSeek-V2/V3/R1 routing where
sigmoid(logits)is used for aggregation whilelogits + biaswith group masking drives selection:Motivation and Context
QMoEpreviously required the same tensor for both routing and weighting, blocking DeepSeek-stylenoaux_tcMoE models where these are intentionally separate. This unblocks ONNX Runtime export/serving of DeepSeek-V2/V3/R1 MoE architectures.Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.